Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 11 de 11
Filter
1.
J Biomed Inform ; 146: 104499, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37714418

ABSTRACT

OBJECTIVE: Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors with several related but distinct biomedical concepts often grouped together and treated as a single topic. This study proposes a new method for the automated refinement of subject annotations at the level of MeSH concepts. METHODS: Lacking labelled data, we rely on weak supervision based on concept occurrence in the abstract of an article, which is also enhanced by dictionary-based heuristics. In addition, we investigate deep learning approaches, making design choices to tackle the particular challenges of this task. The new method is evaluated on a large-scale retrospective scenario, based on concepts that have been promoted to descriptors. RESULTS: In our experiments concept occurrence was the strongest heuristic achieving a macro-F1 score of about 0.63 across several labels. The proposed method improved it further by more than 4pp. CONCLUSION: The results suggest that concept occurrence is a strong heuristic for refining the coarse-grained labels at the level of MeSH concepts and the proposed method improves it further.

2.
BMJ Open ; 13(4): e068698, 2023 04 03.
Article in English | MEDLINE | ID: mdl-37012018

ABSTRACT

INTRODUCTION: Mining of electronic health record (EHRs) data is increasingly being implemented all over the world but mainly focuses on structured data. The capabilities of artificial intelligence (AI) could reverse the underusage of unstructured EHR data and enhance the quality of medical research and clinical care. This study aims to develop an AI-based model to transform unstructured EHR data into an organised, interpretable dataset and form a national dataset of cardiac patients. METHODS AND ANALYSIS: CardioMining is a retrospective, multicentre study based on large, longitudinal data obtained from unstructured EHRs of the largest tertiary hospitals in Greece. Demographics, hospital administrative data, medical history, medications, laboratory examinations, imaging reports, therapeutic interventions, in-hospital management and postdischarge instructions will be collected, coupled with structured prognostic data from the National Institute of Health. The target number of included patients is 100 000. Natural language processing techniques will facilitate data mining from the unstructured EHRs. The accuracy of the automated model will be compared with the manual data extraction by study investigators. Machine learning tools will provide data analytics. CardioMining aims to cultivate the digital transformation of the national cardiovascular system and fill the gap in medical recording and big data analysis using validated AI techniques. ETHICS AND DISSEMINATION: This study will be conducted in keeping with the International Conference on Harmonisation Good Clinical Practice guidelines, the Declaration of Helsinki, the Data Protection Code of the European Data Protection Authority and the European General Data Protection Regulation. The Research Ethics Committee of the Aristotle University of Thessaloniki and Scientific and Ethics Council of the AHEPA University Hospital have approved this study. Study findings will be disseminated through peer-reviewed medical journals and international conferences. International collaborations with other cardiovascular registries will be attempted. TRIAL REGISTRATION NUMBER: NCT05176769.


Subject(s)
Cardiovascular System , Electronic Health Records , Humans , Artificial Intelligence , Retrospective Studies , Research Design , Aftercare , Ecosystem , Patient Discharge , Multicenter Studies as Topic
3.
Artif Intell Med ; 137: 102505, 2023 03.
Article in English | MEDLINE | ID: mdl-36868691

ABSTRACT

Medical Subject Headings (MeSH) is a hierarchically structured thesaurus created by the National Library of Medicine of USA. Each year the vocabulary gets revised, bringing forth different types of changes. Those of particular interest are the ones that introduce new descriptors in the vocabulary either brand new or those who come up as a product of a complex change. These new descriptors often lack ground truth articles and rendering learning models that require supervision not applicable. Furthermore, this problem is characterized by its multi label nature and the fine-grained character of the descriptors that play the role of classes, requiring expert supervision and a lot of human resources. In this work, we alleviate these issues through retrieving insights from provenance information about those descriptors present in MeSH to create a weakly labeled train set for them. At the same time, we make use of a similarity mechanism to further filter the weak labels obtained through the descriptor information mentioned earlier. Our method, called WeakMeSH, was applied on a large-scale subset of the BioASQ 2018 data set consisting of 900 thousand biomedical articles. The performance of our method was evaluated on BioASQ 2020 against several other approaches that had given competitive results in similar problems in the past, or apply alternative transformations against the proposed one, as well as some variants that showcase the importance of each different component of our proposed approach. Finally, an analysis was performed on the different MeSH descriptors each year to assess the applicability of our method on the thesaurus.


Subject(s)
Learning , Medical Subject Headings , United States , Humans
4.
Brief Bioinform ; 24(2)2023 03 19.
Article in English | MEDLINE | ID: mdl-36907663

ABSTRACT

The discovery of drug-target interactions (DTIs) is a pivotal process in pharmaceutical development. Computational approaches are a promising and efficient alternative to tedious and costly wet-lab experiments for predicting novel DTIs from numerous candidates. Recently, with the availability of abundant heterogeneous biological information from diverse data sources, computational methods have been able to leverage multiple drug and target similarities to boost the performance of DTI prediction. Similarity integration is an effective and flexible strategy to extract crucial information across complementary similarity views, providing a compressed input for any similarity-based DTI prediction model. However, existing similarity integration methods filter and fuse similarities from a global perspective, neglecting the utility of similarity views for each drug and target. In this study, we propose a Fine-Grained Selective similarity integration approach, called FGS, which employs a local interaction consistency-based weight matrix to capture and exploit the importance of similarities at a finer granularity in both similarity selection and combination steps. We evaluate FGS on five DTI prediction datasets under various prediction settings. Experimental results show that our method not only outperforms similarity integration competitors with comparable computational costs, but also achieves better prediction performance than state-of-the-art DTI prediction approaches by collaborating with conventional base models. Furthermore, case studies on the analysis of similarity weights and on the verification of novel predictions confirm the practical ability of FGS.


Subject(s)
Drug Development , Drug Discovery , Drug Discovery/methods , Drug Interactions
5.
Brief Bioinform ; 23(5)2022 09 20.
Article in English | MEDLINE | ID: mdl-36070659

ABSTRACT

The discovery of drug-target interactions (DTIs) is a very promising area of research with great potential. The accurate identification of reliable interactions among drugs and proteins via computational methods, which typically leverage heterogeneous information retrieved from diverse data sources, can boost the development of effective pharmaceuticals. Although random walk and matrix factorization techniques are widely used in DTI prediction, they have several limitations. Random walk-based embedding generation is usually conducted in an unsupervised manner, while the linear similarity combination in matrix factorization distorts individual insights offered by different views. To tackle these issues, we take a multi-layered network approach to handle diverse drug and target similarities, and propose a novel optimization framework, called Multiple similarity DeepWalk-based Matrix Factorization (MDMF), for DTI prediction. The framework unifies embedding generation and interaction prediction, learning vector representations of drugs and targets that not only retain higher order proximity across all hyper-layers and layer-specific local invariance, but also approximate the interactions with their inner product. Furthermore, we develop an ensemble method (MDMF2A) that integrates two instantiations of the MDMF model, optimizing the area under the precision-recall curve (AUPR) and the area under the receiver operating characteristic curve (AUC), respectively. The empirical study on real-world DTI datasets shows that our method achieves statistically significant improvement over current state-of-the-art approaches in four different settings. Moreover, the validation of highly ranked non-interacting pairs also demonstrates the potential of MDMF2A to discover novel DTIs.


Subject(s)
Drug Development , Drug Discovery , Algorithms , Drug Discovery/methods , Drug Interactions , Pharmaceutical Preparations , Proteins
6.
IEEE/ACM Trans Comput Biol Bioinform ; 18(4): 1596-1607, 2021.
Article in English | MEDLINE | ID: mdl-31689203

ABSTRACT

Identifying drug-target interactions is crucial for drug discovery. Despite modern technologies used in drug screening, experimental identification of drug-target interactions is an extremely demanding task. Predicting drug-target interactions in silico can thereby facilitate drug discovery as well as drug repositioning. Various machine learning models have been developed over the years to predict such interactions. Multi-output learning models in particular have drawn the attention of the scientific community due to their high predictive performance and computational efficiency. These models are based on the assumption that all the labels are correlated with each other. However, this assumption is too optimistic. Here, we address drug-target interaction prediction as a multi-label classification task that is combined with label partitioning. We show that building multi-output learning models over groups (clusters) of labels often leads to superior results. The performed experiments confirm the efficiency of the proposed framework.


Subject(s)
Computational Biology/methods , Drug Development/methods , Drug Discovery/methods , Machine Learning
7.
IEEE J Biomed Health Inform ; 23(6): 2230-2237, 2019 11.
Article in English | MEDLINE | ID: mdl-30835232

ABSTRACT

The figures found in biomedical literature are a vital part of biomedical research, education, and clinical decision. The multitude of their modalities and the lack of corresponding metadata constitute search and information, retrieval a difficult task. In this paper, we introduce novel multi-label modality classification approaches for biomedical figures without segmenting the compound figures. In particular, we investigate using both simple and compound figures for training a multi-label model to be used for annotating either all figures or only those predicted as compound by a compound figure detection model. Using data from the medical task of ImageCLEF 2016, we train our approaches with visual features and compare them with the approach involving compound figure separation into sub-figures. Furthermore, we study how multimodal learning, from both visual and textual features affects the tasks of classifying biomedical figures by modality and detecting compound figures. Finally, we present a web application for medical figure retrieval, which is based on one of our classification approaches and allows users to search for figures of PubMed Central from any device and provide feedback about the modality of a figure classified by the system.


Subject(s)
Image Interpretation, Computer-Assisted/methods , Image Processing, Computer-Assisted/methods , Information Storage and Retrieval/methods , Machine Learning , Algorithms , Data Mining , Diagnostic Imaging , Humans
8.
J Biomed Inform ; 92: 103118, 2019 04.
Article in English | MEDLINE | ID: mdl-30753948

ABSTRACT

Biomedical question answering (QA) is a challenging task that has not been yet successfully solved, according to results on international benchmarks, such as BioASQ. Recent progress on deep neural networks has led to promising results in domain independent QA, but the lack of large datasets with biomedical question-answer pairs hinders their successful application to the domain of biomedicine. We propose a novel machine-learning based answer processing approach that exploits neural networks in an unsupervised way through word embeddings. Our approach first combines biomedical and general purpose tools to identify the candidate answers from a set of passages. Candidates are then represented using a combination of features based on both biomedical external resources and input textual sources, including features based on word embeddings. Candidates are then ranked based on the score given at the output of a binary classification model, trained from candidates extracted from a small number of questions, related passages and correct answer triplets from the BioASQ challenge. Our experimental results show that the use of word embeddings, combined with other features, improves the performance of answer processing in biomedical question answering. In addition, our results show that the use of several annotators improves the identification of answers in passages. Finally, our approach has participated in the last two versions (2017, 2018) of the BioASQ challenge achieving competitive results.


Subject(s)
Information Storage and Retrieval/methods , Medical Informatics/methods , Neural Networks, Computer , Unsupervised Machine Learning , Algorithms
9.
J Biomed Semantics ; 8(1): 43, 2017 Sep 22.
Article in English | MEDLINE | ID: mdl-28938902

ABSTRACT

BACKGROUND: In this paper we present the approach that we employed to deal with large scale multi-label semantic indexing of biomedical papers. This work was mainly implemented within the context of the BioASQ challenge (2013-2017), a challenge concerned with biomedical semantic indexing and question answering. METHODS: Our main contribution is a MUlti-Label Ensemble method (MULE) that incorporates a McNemar statistical significance test in order to validate the combination of the constituent machine learning algorithms. Some secondary contributions include a study on the temporal aspects of the BioASQ corpus (observations apply also to the BioASQ's super-set, the PubMed articles collection) and the proper parametrization of the algorithms used to deal with this challenging classification task. RESULTS: The ensemble method that we developed is compared to other approaches in experimental scenarios with subsets of the BioASQ corpus giving positive results. In our participation in the BioASQ challenge we obtained the first place in 2013 and the second place in the four following years, steadily outperforming MTI, the indexing system of the National Library of Medicine (NLM). CONCLUSIONS: The results of our experimental comparisons, suggest that employing a statistical significance test to validate the ensemble method's choices, is the optimal approach for ensembling multi-label classifiers, especially in contexts with many rare labels.


Subject(s)
Abstracting and Indexing/methods , Biomedical Research , Machine Learning , Models, Statistical , Semantics
10.
BMC Bioinformatics ; 17 Suppl 5: 173, 2016 Jun 06.
Article in English | MEDLINE | ID: mdl-27295298

ABSTRACT

BACKGROUND: Somatic Hypermutation (SHM) refers to the introduction of mutations within rearranged V(D)J genes, a process that increases the diversity of Immunoglobulins (IGs). The analysis of SHM has offered critical insight into the physiology and pathology of B cells, leading to strong prognostication markers for clinical outcome in chronic lymphocytic leukaemia (CLL), the most frequent adult B-cell malignancy. In this paper we present a methodology for integrating multiple immunogenetic and clinocobiological data sources in order to extract features and create high quality datasets for SHM analysis in IG receptors of CLL patients. This dataset is used as the basis for a higher level integration procedure, inspired form social choice theory. This is applied in the Towards Analysis, our attempt to investigate the potential ontogenetic transformation of genes belonging to specific stereotyped CLL subsets towards other genes or gene families, through SHM. RESULTS: The data integration process, followed by feature extraction, resulted in the generation of a dataset containing information about mutations occurring through SHM. The Towards analysis performed on the integrated dataset applying voting techniques, revealed the distinct behaviour of subset #201 compared to other subsets, as regards SHM related movements among gene clans, both in allele-conserved and non-conserved gene areas. With respect to movement between genes, a high percentage movement towards pseudo genes was found in all CLL subsets. CONCLUSIONS: This data integration and feature extraction process can set the basis for exploratory analysis or a fully automated computational data mining approach on many as yet unanswered, clinically relevant biological questions.


Subject(s)
Immunogenetics/methods , Leukemia, Lymphocytic, Chronic, B-Cell/genetics , Somatic Hypermutation, Immunoglobulin/genetics , Adult , Databases, Genetic , Female , Germ-Line Mutation , Humans , Immunoglobulin Variable Region/genetics , Immunoglobulins/genetics , Leukemia, Lymphocytic, Chronic, B-Cell/pathology
11.
J Hered ; 106(5): 672-6, 2015.
Article in English | MEDLINE | ID: mdl-26137847

ABSTRACT

The advent of high-throughput genomic technologies is enabling analyses on thousands or even millions of single-nucleotide polymorphisms (SNPs). At the same time, the selection of a minimum number of SNPs with the maximum information content is becoming increasingly problematic. Available locus ranking programs have been accused of providing upwardly biased results (concerning the predicted accuracy of the chosen set of markers for population assignment), cannot handle high-dimensional datasets, and some of them are computationally intensive. The toolbox for ranking and evaluation of SNPs (TRES) is a collection of algorithms built in a user-friendly and computationally efficient software that can manipulate and analyze datasets even in the order of millions of genotypes in a matter of seconds. It offers a variety of established methods for evaluating and ranking SNPs on user defined groups of populations and produces a set of predefined number of top ranked loci. Moreover, dataset manipulation algorithms enable users to convert datasets in different file formats, split the initial datasets into train and test sets, and finally create datasets containing only selected SNPs occurring from the SNP selection analysis for later on evaluation in dedicated software such as GENECLASS. This application can aid biologists to select loci with maximum power for optimization of cost-effective panels with applications related to e.g. species identification, wildlife management, and forensic problems. TRES is available for all operating systems at http://mlkd.csd.auth.gr/bio/tres.


Subject(s)
Genetics, Population/methods , Genomics/methods , Polymorphism, Single Nucleotide , Software , Algorithms , Genotype
SELECTION OF CITATIONS
SEARCH DETAIL
...